Parameter Hub: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training

نویسندگان

Liang Luo

Jacob Nelson

Luis Ceze

Amar Phanishayee

Arvind Krishnamurthy

چکیده

Most work in the deep learning systems community has focused on faster inference, but arriving at a trained model requires lengthy experiments. Accelerating training lets developers iterate faster and come up with better models. DNN training is often seen as a compute-bound problem, best done in a single large compute node with many GPUs. As DNNs get bigger, training requires going distributed. Distributed deep neural network (DDNN) training constitutes an important workload on the cloud. Larger DNN models and faster compute engines shift the training performance bottleneck from computation to communication. Our experiments show existing DNN training frameworks do not scale in a typical cloud environment due to insufficient bandwidth and inefficient parameter server software stacks. We propose PHub, a high performance parameter server (PS) software design that provides an optimized network stack and a streamlined gradient processing pipeline to benefit common PS setups, and PBox, a balanced, scalable central PS hardware that fully utilizes PHub capabilities. We show that in a typical cloud environment, PHub can achieve up to 3.8x speedup over state-of-theart designs when training ImageNet. We discuss future directions of integrating PHub with programmable switches for in-network aggregation during training, leveraging the datacenter network topology to reduce bandwidth usage and localize data movement. 1 DISTRIBUTED DNN TRAINING IS COMMUNICATION BOUND The goal of this work is to accelerate distributed DNN training in cloud environments. This work focuses on “data” parallelism, where workers process different samples and share the same model. A training iteration in this paradigm has two main components: computation-heavy forward and backward passes, and a communicationheavy model update step. As DNN models get larger and speedier accelerators emerge, the performance bottleneck of distributed DNN training has shifted from computation to communication.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

How to scale distributed deep learning?

Training time on large datasets for deep neural networks is the principal workflow bottleneck in a number of important applications of deep learning, such as object classification and detection in automatic driver assistance systems (ADAS). To minimize training time, the training of a deep neural network must be scaled beyond a single machine to as many machines as possible by distributing the ...

متن کامل

Online Job Scheduling in Distributed Machine Learning Clusters

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural network, multiple workers are run in parallel to train partitions of the input dataset, and update shared model parameters. In a shared cluster handling multiple tr...

متن کامل

A multi-scale convolutional neural network for automatic cloud and cloud shadow detection from Gaofen-1 images

The reconstruction of the information contaminated by cloud and cloud shadow is an important step in pre-processing of high-resolution satellite images. The cloud and cloud shadow automatic segmentation could be the first step in the process of reconstructing the information contaminated by cloud and cloud shadow. This stage is a remarkable challenge due to the relatively inefficient performanc...

متن کامل

Scaling GRPC Tensorflow on 512 nodes of Cori Supercomputer

We explore scaling of the standard distributed Tensorflow [1] with GRPC primitives on up to 512 Intel R © Xeon PhiTM (KNL) nodes of Cori supercomputer [2] with synchronous stochastic gradient descent (SGD), and identify causes of scaling inefficiency at higher node counts. To our knowledge, this is the first exploration of distributed GRPC Tensorflow’s scalability on a HPC supercomputer at such...

متن کامل

An adaptive estimation method to predict thermal comfort indices man using car classification neural deep belief

Human thermal comfort and discomfort of many experimental and theoretical indices are calculated using the input data the indicator of climatic elements are such as wind speed, temperature, humidity, solar radiation, etc. The daily data of temperature، wind speed، relative humidity، and cloudiness between the years 1382-1392 were used. In the First step، Tmrt parameter was calculated in the Ray...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1801.09805 شماره

صفحات -

تاریخ انتشار 2018

Parameter Hub: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training

نویسندگان

چکیده

منابع مشابه

How to scale distributed deep learning?

Online Job Scheduling in Distributed Machine Learning Clusters

A multi-scale convolutional neural network for automatic cloud and cloud shadow detection from Gaofen-1 images

Scaling GRPC Tensorflow on 512 nodes of Cori Supercomputer

An adaptive estimation method to predict thermal comfort indices man using car classification neural deep belief

عنوان ژورنال:

اشتراک گذاری